# Q1
#bnb <- read.csv("sf_listings_2312.csv")02 - Text Analysis
University of San Francisco, MSMI-608
Outline
Pre-Class Code Assignment Instructions
In this semester, I am going to ask you to do a fair bit of work before coming to class. This will make our class time shorter, more manageable, and hopefully less boring.
I am also going to use this as an opportunity for you to directly earn grade points for your effort/labor, rather than “getting things right” on an exam.
Therefore, I will ask you to work through the posted slides on Canvas before class. Throughout the slides, I will post Pre-Class Questions for you to work through in R. These will look like this:
Pre-Class Q1
In R, please write code that will read in the .csv from Canvas called sf_listings_2312.csv. Assign this the name bnb.
You will then write your answer in a .r script:
Important:
To earn full points, you need to organize your code correctly. Specifically, you need to:
- Answer questions in order.
- If you answer them out of order, just re-arrange the code after.
- Preface each answer with a comment (
# Q1/# Q2/# Q3) that indicates exactly which question you are answering.- Please just write the letter Q and the number in this comment.
- Make sure your code runs on its own, on anyone’s computer.
- To do this, I would always include
rm(list = ls())at the top of every .r script. This will clean everything from the environment, allowing you to see if this runs on my computer.
- To do this, I would always include
Handing this in:
- You must submit this to Canvas before 9:00am on the day of class. Even if class starts at 10:00am that day, these are always due at 9:00.
- You must submit this code as a
.txtfile. This is because Canvas cannot present.Rfiles to me in SpeedGrader. To save as.txt:- Click File -> New File -> Text File
- Copy and paste your completed code to that new text file.
- Save the file as
firstname_lastname_module.txt- For example, my file for Module 01 would be
matt_meister_01.txt - My file for module 05 would be
matt_meister_05.txt
- For example, my file for Module 01 would be
Grading:
- I will grade these for completion.
- You will receive 1 point for every question you give an honest attempt to answer
- Your grade will be the number of questions you answer, divided by the total number of questions.
- This is why it is important that you number each answer with
# Q1. - Any questions that are not numbered this way will be graded incomplete, because I can’t find them.
- This is why it is important that you number each answer with
- You will receive a 25% penalty for submitting these late.
- I will post my solutions after class.
Text Analysis
Load in these packages. If you do not have them, you will need to install them.
- e.g.,
install.packages("tidytext")
library(tidytext)
library(stringr)
library(dplyr)
library(ggplot2)
library(topicmodels)
library(tidyr)
library(Matrix)Read in the Airbnb listings from last class (as bnb) as well as the reviews (on Canvas):
revs <- read.csv('sf_reviews_2312.csv')We have spent a lot of time with numbers. We have even dabbled a bit in turning text into numbers. For example, whenever we have made dummy codes/indicator variables (e.g., for gender), we are taking words and turning them into numbers.
Text is an extremely useful form of data, especially for us as market researchers. However, it is not always obvious how to take text and turn it into something that we can test–or use to test other things.
In this module, I will briefly introduce four kinds of text analysis:
- Bag-of-words/sentiment
- Topic modeling
- Keywords
- Classification
Unfortunately, we do not have the time to go into great detail on any one of these. Doing so could be an entire class. Therefore, I suggest–if you are interested–looking online at the many, many blogs/walkthroughs you can find about these.
Join Reviews and Listings
We are going to analyze the text of the reviews we have for our Airbnb listings. To do so, it would be nice to have the listing information joined with each of the Airbnb snapshots. We can do this with a join function, from dplyr. There are multiple kinds of joins:
1. Inner Join (inner_join)
- Description: Combines two datasets by returning only the rows with matching keys in both datasets.
- Use case: Use this when you need data that exists in both datasets.
- Example:
inner_join(df1, df2, by = "key_column")
2. Left Join (left_join)
- Description: Returns all rows from the first dataset (
df1) and the matching rows from the second dataset (df2). If no match is found,NAis returned for columns fromdf2. - Use case: Use this when the focus is on keeping all rows from the left dataset.
- Example:
left_join(df1, df2, by = "key_column")
3. Right Join (right_join)
- Description: Returns all rows from the second dataset (
df2) and the matching rows from the first dataset (df1). If no match is found,NAis returned for columns fromdf1. - Use case: Use this when the focus is on keeping all rows from the right dataset.
- Example:
right_join(df1, df2, by = "key_column")
4. Full Join (full_join)
- Description: Returns all rows from both datasets. Rows with no match in either dataset will have
NAin the missing columns. - Use case: Use this when you want a complete merge of both datasets, keeping all rows regardless of matching.
- Example:
full_join(df1, df2, by = "key_column")
5. Semi Join (semi_join)
- Description: Returns only the rows from the first dataset (
df1) that have a match in the second dataset (df2). It does not add columns fromdf2. - Use case: Use this to filter rows in the first dataset based on matching keys in the second dataset.
- Example:
semi_join(df1, df2, by = "key_column")
6. Anti Join (anti_join)
- Description: Returns only the rows from the first dataset (
df1) that do not have a match in the second dataset (df2). - Use case: Use this to find rows in the first dataset that have no match in the second dataset.
- Example:
anti_join(df1, df2, by = "key_column")
We will use inner_join(), because I want to only use reviews for which we have listing information.
In our case, the key column is the listing id. Unfortunately, this is called id in the bnb data frame, and listing_id in the revs data frame. I think the easiest thing is to create a new variable in bnb:
Pre-Class Q1
How can we join revs and bnb?
bnb$listing_id <- bnb$id
revs <- inner_join(revs, bnb, by = 'listing_id', suffix = c('_revs', '_listing'))suffix =tells R what to put at the end of each column that is in both data frames. This tells us where something came from if it is duplicated.
Cleaning Text
Why?
Text data is often messy, and has a lot of elements we can’t use, such as punctuation, numbers, extra spaces, and special characters. These can obscure underlying patterns, and/or make the data impossible to use. Cleaning the text helps:
- Improve the quality of analysis.
- Ensure uniformity in the text (e.g., all lowercase).
- Prepare the text for further processing, such as tokenization or sentiment analysis.
How?
- Lowercasing: Ensures uniformity by converting all text to lowercase.
- Removing punctuation: Strips out special characters that don’t contribute to the meaning of the text.
- Removing numbers: Eliminates numeric characters if they are not relevant.
- Removing stopwords: Excludes common words (e.g., “and,” “the”) that don’t add much value.
- Trimming whitespace: Removes extra spaces for clean formatting.
Example
The text in revs is contained in the column comments. Below, we will complete each step mentioned above.
Pre-Class Q2
Lowercasing:
revs$clean_text <- tolower(revs$comments)Pre-Class Q3
Remove punctuation:revs$clean_text <- str_remove_all(revs$clean_text, "[[:punct:]]")Pre-Class Q4
Remove numbers:revs$clean_text <- str_remove_all(revs$clean_text, "\\d*")Pre-Class Q5
Remove stopwords:revs$clean_text <- str_remove_all(revs$clean_text,
paste(stop_words$word, collapse = "\\b|\\b"))- When we loaded
tidytext, it also loaded the data framestop_wordsin the background. - This contains a bunch of very common words, which don’t add much.
What’s the \\b?
The \\b ensures that the pattern matches whole words only, rather than substrings inside larger words.
In our example:
str_remove_all(revs$clean_text, paste(stop_words$word, collapse = "\\b|\\b"))
The \\b ensures that stopwords like “a” or “the” are removed only when they are standalone words, not parts of other words.
- The word “a” will be removed from “we found a rat in the fridge”, but the letter “a” will not
- Result: “we found rat in the fridge”
- This prevents accidental removal of parts of words, ensuring cleaner and more accurate text processing.
Pre-Class Q6
Remove whitespace:
revs$clean_text <- str_squish(revs$clean_text)Purpose of Each Step
- Convert to Lowercase (
tolower()):- Text normalization to treat “Hello” and “hello” as the same.
- Remove Punctuation (
str_remove_all("[[:punct:]]")):- Strips out symbols like
!,?, or.that don’t carry semantic meaning.
- Strips out symbols like
- Remove Numbers (
str_remove_all("\\d*")):- Gets rid of numeric values that may not be relevant to the analysis.
- Remove Stopwords (
str_remove_all):- Uses a pre-defined list of stopwords from the
tidytextpackage.
- Uses a pre-defined list of stopwords from the
- Trim Extra Whitespace (
str_squish):- Cleans up any remaining extra spaces for neatness.
Outcome
Here’s what the updated revs data looks like:
head(revs[,c('comments', 'clean_text')]) comments
1 The bad: Overall felt like a college dorm. The room was very small. the common areas dingy and the carpet and walls dirty. There’s a mildew smell on the air mixed with the many spices of everyone’s dinner. With no air conditioning or fan you have to sleep with the windows open, but the street noise is bad. The front door is an iron behemoth that shakes the whole place with a boom every time it closes. The good: convenient location and good communication
2 We really liked Lily's place! It was very clean and practical for our visit to San Francisco. The location is good, specially if you have a car. Heads-up, don’t park on the side walk. Although we parked exactly where Lily told us, we got a parking ticket.
3 A good place to crash in a great location
4 My partner and I were in SF for three days and feel like this hostel was well worth the price! The bathroom and showers were always clean and its location right outside of Chinatown was so convenient. The front desk was very accommodating and helpful with the check-in and check-out process, even offering to hold our bags for a few hours for the latter. The surrounding area generally felt safe, with my partner and I taking walks around the area until 9-10pm. Great experience overall! Highly recommend, especially given its location.
5 Couldn’t recommend this place enough! Richard and Bina we both so lovely, helpful, and went above and beyond on communication. Very grateful!
6 We had a great stay in Noe Valley! Hosts were so nice and eager to help with anything. Can’t beat the location!
clean_text
1 bad college dorm common dingy carpet walls dirty mildew smell air mixed spices everyones dinner air conditioning fan sleep windows street noise bad front door iron behemoth shakes boom time closes convenient location communication
2 lilys clean practical visit san francisco location specially car headsup dont park walk parked lily told parking ticket
3 crash location
4 partner sf days feel hostel worth price bathroom showers clean location chinatown convenient front desk accommodating helpful checkin checkout process offering hold bags hours surrounding safe partner taking walks pm experience highly recommend location
5 couldnt recommend richard bin lovely helpful communication grateful
6 stay noe valley hosts nice eager beat location
Here’s a 10-minute workshop outline for introducing Bag-of-Words (BoW) Analysis using tidyverse and tidytext, with a focus on sentiment analysis as the starting point. The goal is to give students hands-on experience with BoW and demonstrate its flexibility.
Bag of Words/Sentiment Analysis
What is Bag-of-Words?
BoW represents text as a collection of words, ignoring grammar and word order, focusing only on the presence (or frequency) of words. Effectively, this is identifying if a word exists in some text. For example, Did this reviewer say “rat”?
Advantages: - Simplicity and interpretability. - Versatile for various analyses. Disadvantages: - Ignores word order and context (e.g., sarcasm). - Can be improved with more sophisticated methods (e.g., word embeddings).
- Use Cases:
- Sentiment analysis.
- Word frequency analysis.
- Text classification and topic modeling.
Sentiment Analysis Example:
We will use the revs dataset to compute the average sentiment score of reviews. Sentiment analysis helps understand customer opinions. It is especially useful when we can’t quantify those opinions in other ways. With reviews, we usually have a star rating, but we actually don’t on Airbnb. So this will help fill in a gap.
Sentiment analysis is also a relatively simple application of BoW. It just assigns positive/negative labels to text.
Steps
- Tokenization: Break text into individual words.
- Join Sentiments: Match words with sentiment scores.
- Compute Scores: Summarize sentiment for each review.
Step 1: Tokenization
We tokenize with the function unnest_tokens() in R. The function takes a data frame, output, and input column. For us, the output is always going to be called “word”. To see how this works, let’s use a simple example. Remember, the text has been cleaned.
two_reviews <- data.frame(
id = c(1,2),
rating = c(2,5),
clean_text = c("disgusting this place was full of rats", "beautiful rats")
)
two_reviews id rating clean_text
1 1 2 disgusting this place was full of rats
2 2 5 beautiful rats
two_reviews_tokens <- unnest_tokens(two_reviews, word, clean_text)
two_reviews_tokens id rating word
1 1 2 disgusting
2 1 2 this
3 1 2 place
4 1 2 was
5 1 2 full
6 1 2 of
7 1 2 rats
8 2 5 beautiful
9 2 5 rats
Step 2: Join Sentiments
To join sentiments, we have to get a data frame of words and sentiments from somewhere. Luckily, similar to the stop.words, there are also sentiment data frames that come with tidytext. We will use the bing lexicon, for no real reason. Feel free to try others.
bing_sentiments <- get_sentiments("bing")
head(bing_sentiments)# A tibble: 6 × 2
word sentiment
<chr> <chr>
1 2-faces negative
2 abnormal negative
3 abolish negative
4 abominable negative
5 abominably negative
6 abominate negative
bing_sentiments is a data frame of 6786 words, each with either a positive or negative tag. This is essentially our “bag” of words.
Using inner_join(), we can see how many positive and negative words are in each review. We can save this as two_reviews_bing.
two_reviews_tokens |>
inner_join(bing_sentiments, by = "word") id rating word sentiment
1 1 2 disgusting negative
2 2 5 beautiful positive
two_reviews_tokens |>
inner_join(bing_sentiments, by = "word") |>
count(id, sentiment, sort = TRUE) id sentiment n
1 1 negative 1
2 2 positive 1
two_reviews_bing <- two_reviews_tokens |>
inner_join(bing_sentiments, by = "word") |>
count(id, sentiment, sort = TRUE) Step 3: Summarize each review
two_reviews_tokens |>
inner_join(bing_sentiments, by = "word") |>
count(id, sentiment, sort = TRUE) |>
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
mutate(sentiment_score = positive - negative)# A tibble: 2 × 4
id negative positive sentiment_score
<dbl> <int> <int> <int>
1 1 1 0 -1
2 2 0 1 1
And save that summary as two_reviews_sentiment.
two_reviews_sentiment <- two_reviews_tokens |>
inner_join(bing_sentiments, by = "word") |>
count(id, sentiment, sort = TRUE) |>
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
mutate(sentiment_score = positive - negative)Pre-Class Q7
Now, let’s do all of these steps for the revs data frame.
Perform Tokenization for revs. Save the result as revs_tokens.
revs_tokens <- revs |>
unnest_tokens(word, clean_text)Pre-Class Q8
Get sentiments and join them with revs_tokens. Save this as revs_tokens_bing.
bing_sentiments <- get_sentiments("bing")
revs_tokens_bing <- revs_tokens |>
inner_join(bing_sentiments, by = "word") |>
count(id, sentiment, sort = TRUE)
revs_sentiment <- revs_tokens_bing |>
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
mutate(sentiment_score = positive - negative)Pre-Class Q9
Summarize revs_tokens_bing. Save this as revs_sentiment.
revs_sentiment <- revs_tokens_bing |>
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) |>
mutate(sentiment_score = positive - negative)Pre-Class Q10
Join this back to revs, and save it as revs.
revs <- revs |>
left_join(revs_sentiment, by = "id")Pre-Class Q11
Visualize the distribution of sentiment.
More Bags, More Words
This is a super flexible tool. You can look for anything. But, there are more common things to do. For example, Word Frequency Analysis identifies the most common words in reviews.
Pre-Class Q12
Find the 10 most common words in revs.
word_frequencies <- revs_tokens |>
count(word, sort = TRUE)
head(word_frequencies, 10) word n
1 stay 2445
2 br 1382
3 location 1255
4 clean 1213
5 host 931
6 nice 801
7 comfortable 783
8 recommend 736
9 easy 716
10 home 657
Topic Modeling with LDA
Latent Dirichlet Allocation (LDA) is a powerful machine learning method for discovering hidden topics in text data. It groups words that appear together frequently, which you can then label, and use to classify text later. LDA also requires a decent amount of computing power, so I am not going to ask coding questions in this section. I will ask you to discuss my results.
What is Topic Modeling?
Topic modeling is an “unsupervised” machine learning technique that identifies hidden themes or topics in a collection of documents. Unsupervised just means that there is no set outcome. There is no dependent variable. Topic modeling will just show what words appear together often.
You can use this to identify key themes, or to classify text into topics. For example: Airbnb reviews that talk about rats vs Airbnb reviews that talk about crime. The key difference between using topic modeling and bag-of-words for classification is that topic modeling should pick up context better than bag-of-words.
Advantages:
- Unsupervised: No need for labeled data.
- Interpretable: Provides human-readable results.
Disadvantages:
- Requires careful preprocessing.
- May struggle with small datasets or very sparse data.
Implementation
To extract topics, we fit an LDA model to the “document-term matrix”. Specifically, we tell LDA how many topics we want with k =. I am going to ask for four.
The document-term matrix is similar to our data frame of tokens, but is summarized (somewhat) at the document level. For us, the document is each review, the term is each word. This document-term matrix tells us, for every word in all reviews, if the word is in that specific review.
# Convert to document-term matrix
revs_tokens <- revs |>
unnest_tokens(word, clean_text) |>
count(id, word, sort = TRUE) |> # Count word frequencies by review
ungroup()
dtm <- cast_dtm(revs_tokens, document = id, term = word, value = n)
lda_model <- LDA(dtm, k = 4, control = list(seed = 123))Once this has run, we can identify the most important words for each topic. I am going to show 10.
# Extract top words for each topic
topics <- tidy(lda_model, matrix = "beta")
top_terms <- topics |>
group_by(topic) |>
slice_max(beta, n = 10) |>
arrange(-beta) |>
ungroup() |>
mutate(term = factor(term))
# View top words
ggplot(top_terms, aes(x = term, y = beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip() +
labs(
title = "Top 10 Terms for Each Topic",
x = "Terms",
y = "Beta Value (Probability)"
) +
theme_minimal()From these results, we would have to interpret the topics ourselves.
Pre-Class Q12
How would you label each of these four topics? Is there an obvious distinction between them?
Pre-Class Q13
We have missed something important in our pre-processing here. What do you notice about the words in each topic that we have missed? Alternatively, do you think we should fix this?
Keyword Analysis with TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) is a technique for identifying keywords that are unique to individual documents. This can be useful to distinguish keywords that tell us something meaningful from words that are merely used often.
What is TF-IDF?
TF-IDF measures the importance of a word in a document relative to all of our documents (which is called the corpus). - Formula:
[(t, d) = (t, d) (t)]
- Term Frequency (TF): Frequency of the word (t) in document (d). - Inverse Document Frequency (IDF): Logarithm of the ratio of total documents to the number of documents containing (t). - This highlights words that are frequent in one document but rare across others, identifying keywords unique to each document.
Advantages:
- Balances local and global importance of words.
- Useful for keyword extraction, document summarization, and search.
Disadvantages:
- Does not consider word context or semantics.
- Sensitive to noisy or sparse data.
Implementation
Keyword analysis at the level of each review can be helpful, but it is potentially more useful to do this at the neighborhood level, with neighbourhood_cleansed.
To prepare the data for TF-IDF analysis, we have to tokenize and count word frequencies. We will do that within neighbourhood_cleansed.
revs_tokens <- revs |>
unnest_tokens(word, clean_text) |>
count(neighbourhood_cleansed, word, sort = TRUE)
# Total word counts per document (for term frequency calculation)
revs_tokens <- revs_tokens |>
group_by(neighbourhood_cleansed) |>
mutate(total_words = sum(n)) |>
ungroup()Then, we can use functions from the tidytext package to calculate the TF-IDF scores for each word in each neighborhood. In this code, I am also going to only show five keywords per neighborhood.
# Compute TF-IDF
tf_idf <- revs_tokens |>
bind_tf_idf(term = word, document = neighbourhood_cleansed, n = n)
tf_idf <- tf_idf |>
arrange(desc(tf_idf)) |>
select(neighbourhood_cleansed, word, tf_idf) |>
group_by(neighbourhood_cleansed) |>
slice_max(tf_idf, n = 5) |>
ungroup()
# Print top keywords
head(tf_idf, 20)# A tibble: 20 × 3
neighbourhood_cleansed word tf_idf
<chr> <chr> <dbl>
1 Bayview tim 0.0102
2 Bayview jian 0.00880
3 Bayview chris 0.00778
4 Bayview dongmei 0.00635
5 Bayview karl 0.00440
6 Bernal Heights bernal 0.0201
7 Bernal Heights heights 0.00784
8 Bernal Heights tyler 0.00507
9 Bernal Heights ruben 0.00457
10 Bernal Heights cortland 0.00442
11 Castro/Upper Market castro 0.0210
12 Castro/Upper Market dolores 0.00725
13 Castro/Upper Market todd 0.00554
14 Castro/Upper Market ashish 0.00487
15 Castro/Upper Market mission 0.00430
16 Chinatown staff 0.0137
17 Chinatown motel 0.0127
18 Chinatown mary 0.0101
19 Chinatown gainengs 0.00920
20 Chinatown chinatown 0.00658
What makes each neighborhood unique?
Keyword extraction is most useful in distinguishing things. For example, we might want to know what keywords in each neighborhood are different from others. Here is one way we can do that:
Pre-Class Q14
Step 1: Tokenize and count word frequencies by neighborhood
revs_tokens <- revs |>
unnest_tokens(word, clean_text) |>
count(neighbourhood_cleansed, word, sort = TRUE) |>
group_by(neighbourhood_cleansed) |>
mutate(total_words = sum(n)) |>
ungroup()Pre-Class Q15
Step 2: Compute TF-IDF to identify terms
tf_idf <- revs_tokens |>
bind_tf_idf(term = word, document = neighbourhood_cleansed, n = n)Pre-Class Q16
Step 3: Calculate “proportional frequencies” for comparison across neighborhoods. This shows how large of a proportion of all reviews in this neighborhood each word is, compared to the same proportion overall.
tf_idf <- tf_idf |>
mutate(proportion = n / total_words) |>
group_by(word) |>
mutate(avg_proportion = mean(proportion),
z_score = (proportion - avg_proportion) / sd(proportion)) |>
ungroup()Pre-Class Q17
Step 4: Identify the top distinctive keyword for each neighborhood.
distinctive_keywords <- tf_idf |>
group_by(neighbourhood_cleansed) |>
slice_max(z_score, n = 1) |> # Select most distinctive term
arrange(neighbourhood_cleansed, desc(z_score)) |>
ungroup()Pre-Class Q18
Step 5: Interpret the results. What do you notice about these neighborhoods, based on their key words?
print(distinctive_keywords)# A tibble: 36 × 10
neighbourhood_cleansed word n total_words tf idf tf_idf
<chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 Bayview pretty 8 1694 0.00472 0.251 0.00119
2 Bernal Heights stores 10 3922 0.00255 0.492 0.00126
3 Castro/Upper Market heart 17 5155 0.00330 0.750 0.00247
4 Chinatown free 5 779 0.00642 0.405 0.00260
5 Crocker Amazon entire 2 899 0.00222 0.639 0.00142
6 Diamond Heights spacious 3 84 0.0357 0.150 0.00534
7 Downtown/Civic Center muy 19 3590 0.00529 0.539 0.00285
8 Excelsior smooth 3 2508 0.00120 0.944 0.00113
9 Financial District price 8 1008 0.00794 0.405 0.00322
10 Glen Park appreciated 5 661 0.00756 0.216 0.00164
# ℹ 26 more rows
# ℹ 3 more variables: proportion <dbl>, avg_proportion <dbl>, z_score <dbl>
Pre-Class Q19
This is going to be hard to do. But, for two points, I want you to try to make a map that has each keyword plotted in its neighborhood.